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Abstract 

We discuss data representation which can be learned automatically 
from data, are invariant to transformations, and at the same time selec¬ 
tive, in the sense that two points have the same representation only if they 
are one the transformation of the other. The mathematical results here 
sharpen some of the key claims of i-theory - a recent theory of feedforward 
processing in sensory cortex. laiiE]. 
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1 Introduction 

This paper considers the problem of learning ’’good” data representation which 
can lower the need of labeled data (sample complexity) in machine learning 
(ML). Indeed, while current ML systems have achieved impressive results in a 
variety of tasks, an obvious bottleneck appears to be the huge amount of labeled 
data needed. This paper builds on the idea that data representation, which are 
learned in an unsupervised manner, can be key to solve the problem. Classical 
statistical learning theory focuses on supervised learning and postulates that a 
suitable hypothesis space is given. In turn, under very general conditions, the 
latter can be seen to be equivalent to a data representation. In other words, 
data representation and how to select and learn it, is classically not considered 
to be part of the learning problem, but rather as a prior information. In practice 
ad hoc solutions are often empirically found for each problem. 

The study in this paper is a step towards developing a theory of learning 
data representation. Our starting point is the intuition that, since many learning 
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tasks are invariant to transformations of the data, learning invariant represen¬ 
tation from “unsupervised” experiences can significantly lower the ’’size” of the 
problem, effectively decreasing the need of labeled data. In the following, we 
formalize the above idea and discuss how such invariant representations can 
be learned. Crucial to our reasoning is the requirement for invariant represen¬ 
tations to satisfy a form of selectivity, broadly referred to as the property of 
distinguishing images which are not one the transformation of the other. In¬ 
deed, it is this latter requirement that informs the design of non trivial invariant 
representations. Our work is motivated by a theory of cortex and in particular 
visual cortex 

Data representation is a classical concept in harmonic analysis and signal 
processing. Here representations are typically designed on the basis of prior 
information assumed to be available. More recently, there has been an effort 
to automatically learn adaptive representation on the basis of data samples. 
Examples in this class of methods include so called dictionary learning [50] . 
autoencoders [5] and metric learning techniques (see e.g. [Ml)- The idea of 
deriving invariant data representation has been considered before. For example 
in the analysis of shapes m and more generally in computational topology m. 
or in the design of positive definite functions associated to reproducing kernel 
Hilbert spaces [T^]. However, in these lines of study the selectivity properties of 
the representations have hardly been considered. The ideas in [2H] are close 
in spirit to the study in this paper. In particular, the results in |22j develop a 
different invariant and stable representation within a signal processing frame¬ 
work. In |28j an information theoretic perspective is considered to formalize the 
problem of learning invariant/selective representations. 

In this work we develop a machine learning perspective closely following 
computational neuroscience models of the information processing in the visual 
cortex [laiiiiiii]. Our first and main result shows that, for compact groups, 
representation defined by nonlinear group averages can be shown to be invariant, 
as well as selective, to the action of the group. While invariance follows from 
the properties of the Haar measure associated to the group, selectivity is shown 
using probabilistic results that characterize a probability measure in terms of 
one dimensional projections. This set of ideas, which form the core of the paper, 
is then extended to local transformations, and multilayer architectures. These 
results bear some understanding to the nature of certain deep architecture, in 
particular neural networks of the convolution type. 

The rest of the paper is organized as follows. We describe the concept of 
invariance and selective representation in Section and their role for learning in 
Section]^ We discuss a family of invariant/selective representation for transfor¬ 
mations which belong to compact groups in Section 13 that we further develop in 
Sectionsand Finally we conclude in Section [Tjwith some final comments. 
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2 Invariant and Selective Data Representations 

We next formalize and discuss the notion of invariant and selective data repre¬ 
sentation, which is the main focus of the rest of the paper. 

We model the data space as a (real separable) Hilbert space I and denote by 
(•,•) and 11'll the inner product and norm, respectively. Example of data spaces 
are one dimensional signals (as in audio data), where we could let I C L^(M), or 
two dimensional signals (such as images), where we could let X C L^(M^). After 
discretization, data can often be seen as vectors in high-dimensional Euclidean 
spaces, e.g. X = The case of (digital) images serves as a main example 
throughout the paper. 

A data representation is a map from the data space in a suitable represen¬ 
tation space, that is 

/i ! X —V J~ . 

Indeed, the above concept appears under different names in various branch of 
pure and applied sciences, e.g. it is called an encoding (information theory), a 
feature map (learning theory), a transform (harmonic analysis/signal process¬ 
ing) or an embedding (computational geometry). 

In this paper, we are interested in representations which are invariant (see 
below) to suitable sets of transformations. The latter can be seen as a set of 
maps 

^ C {(/ I g : I —> I}. 

Many interesting examples of transformations have a group structure. Recall 
that a group is a set endowed with a well defined composition/multiplication 
operation satisfying four basic properties, 

• closure: gg' G Q, for all g,g' & G 

• associativity: {gg')g'' = g{g'g''), for all g,g',g'' G Q 

• identity: there exists Id G such that Id;/ = gid = g, for all g € G- 

• invertibility: for all g G there exists g~^ G G such that {gg~/ = Id. 

There are different kind of groups. In particular, “small” groups such as com¬ 
pact (or locally compact, i.e. a group that admits a locally compact Hausdorff 
topology such that the group operations of composition and inversion are con¬ 
tinuous.) groups, or “large” groups which are not locally compact. In the case of 
images, examples of locally compact groups include affine transformations (e.g. 
scaling, translations, rotations and their combinations) which can be thought 
of as suitable viewpoint changes. Examples of non locally compact groups are 
diffeomorphisms, which can be thought of as various kind of local or global 
deformations. 

Example 1. Let I G L^(M). A basic example of group transformation is given 
by the translation group, which can he represented as a family of linear operators 

Tr : L^{m.) ^ Trlfp) = I{p-t), VpGM,/Gl, 
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for r G M. Other basic examples of locally compact groups include scaling (the 
multiplication group) and affine transformations (affine group). Given a smooth 
map d : M —>■ K a dijfeomorphism can also be seen as a linear operator given by 

Dd : ^ L^iR), Ddl{p) = I{d{p)), Vp G M, / G I. 

Note that also in this case the representation is linear. 

Clearly, not all transformations have a group structure- think for example of 
images obtained from three dimensional rotations of an object. 

Given the above premise, we next discuss, properties of data representation with 
respect to transformations. We first add one remark about the notation. 

Remark 1 (Notation: Group Action and Representation). If Q is a group 
and I a set, the group action is the map {g, x) i—)■ g.x G I. In the following, 
with an abuse of notation we will denote by gx the group action. Indeed, when 
I is a linear space, we also often denote by g both a group element and its 
representation, so that g can be identified with a linear operator. Throughout 
the article we assume the group representation to be unitary 

To introduce the notion of invariant representation, we recall that an orbit 
associated to an element I G X is the set Oj C I given by O/ = {/' G X | /' = 
gl, 9 & G}. Orbits form a partition of X in equivalence classes, with respect to 
the equivalence relation, 

I ^ I' 3 g & G such that gl = I', 

for all /, /' G X. We have the following definition. 

Definition 1 (Invariant Representation). We say that a representation p, is 
invariant with respect to G if 

I^I' 


for all I, I' G X. 

In words, the above definition states that if two data points are one the trans¬ 
formation of the other, than they will have the same representation. Indeed, if 
a representation /r is invariant 


t{I) = Tigi) 

for all / G X, 5 G G. Clearly, trivial invariant representations can be defined, e.g. 
the constant function. This motivates a second requirement, namely selectivity. 

Definition 2 (Selective Representation). We say that a representation p, is 
selective with respect to G if 

m(/) = Kn ^i^r, 


for all I, /' G X. 
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Together with invariance, selectivity asserts that two points have the same 
representation if and only if they are one a transformation of the other. Sev¬ 
eral comments are in order. First, the requirement of exact invariance as in 
Definition seems desirable for (locally) compact groups, but not for non lo¬ 
cally compact group such as diffeomorphisms. In this case, requiring a form of 
stability to small transformations seems to be natural, as it is more generally 
to require stability to small perturbations, e.g. noise (see [22]). Second, the 
concept of selectivity is natural and requires that no two orbits are mapped 
in the same representation. It corresponds to an injectivity property of a rep¬ 
resentation on the quotient space 1/ ~. Assuming T to be endowed with a 
metric djr, a stronger requirement would be to characterize the metric embed¬ 
ding induced by p,, that is to control the ratio (or the deviation) of the distance 
of two representation and the distance of two orbits. Indeed, the problem of 
finding invariant and selective representation, is tightly related to the problem 
of finding an injective embedding of the quotient space 1/ ~. 

We next provide a discussion of the potential impact of invariant represen¬ 
tations on the solution of subsequent learning tasks. 

3 Prom Invariance to Low Sample Complexity 

In this section we first recall how the concepts of data representation and hy¬ 
pothesis space are closely related, and how the sample complexity of a supervised 
problem can be characterized by the covering numbers of the hypothesis space. 
Then, we discuss how invariant representations can lower the sample complexity 
of a supervised learning problem. 

Supervised learning amounts to finding an input-output relationship on the 
basis of a training set of input-output pairs. Outputs can be scalar or vector 
valued, as in regression, or categorical, as in multi-category or multi-label clas¬ 
sification, binary classification being a basic example. The bulk of statistical 
learning theory is devoted to study conditions under which learning problems 
can be solved, approximately and up to a certain confidence, provided a suitable 
hypothesis space is given. A hypotheses space is a subset 

Hc{f \ f 

of the set of all possible input output relations. As we comment below, under 
very general assumptions hypothesis spaees and data representations are equiv¬ 
alent coneepts. 

3.1 Data Representation and Hypothesis Space 

Indeed, practically useful hypothesis spaces are typically endowed with a Hilbert 
space structure, since it is in this setting that most computational solutions can 
be developed. A further natural requirement is for evaluation functions to be 
well defined and continuous. This latter property allows to give a well defined 
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meaning of the evaluation of a function at every points, a property which is ar¬ 
guably natural since we are interested in making predictions. The requirements 
of 1) being a Hilbert space of of functions and 2) have continuous evaluation 
functionals, define so called reproducing kernel Hilbert spaces |Mj. Among other 
properties, these spaces of functions are characterized by the existence of a fea¬ 
ture map fi : X ^ J-, which is a map from the data space into a feature space 
which is itself a Hilbert space. Roughly speaking, functions in a RKHS H with 
an associated feature map fj, can be seen as hyperplanes in the feature space, in 
the sense that V/ G "H, there exists w G such that 

/(/) = Km(/))^, v/gx. 

The above discussion illustrates how, under mild assumptions, the choice of a 
hypothesis space is equivalent to the choice of a data representation (a feature 
map). In the next section, we recall how hypothesis spaces, hence data repre¬ 
sentation, are usually assumed to be given in statistical learning theory and are 
characterized in terms of sample complexity. 

3.2 Sample Complexity in Supervised Learning 

Supervised statistical learning theory characterizes the difficulty of a learning 
problem in terms of the ’’size” of the considered hypothesis space, as measured 
by suitable capacity measures. More precisely, given a measurable loss function 
V : 3^ X y —7> [0, oo), for any measurable function / : X —>• y the expected error 
is defined as 

£if) = I Vif{I),y)dpiI,y) 

where p is a probability measure on Xxy. Given a training set = {(Ii, j/i),..., 
(dn,?/n)} of input-output pairs sampled identically and independently with re¬ 
spect to p, and a hypothesis space "H, the goal of learning is to find an approx¬ 
imate solution fn = /s„ G H to the problem 

inf £(/) 

/ew 

The difficulty of a learning problem is captured by the following definition. 

Definition 3 (Learnability and Sample Complexity). A hypothesis space % is 
said to be learnable if, for all e G [0,oo), 6 G [0,1], there exists n(e,S,'H) G N 
such that 

inf sup P (^£if„) - (1) 

The quantity n(e, S, TL) is called the sample complexity of the problem. 

The above definition characterizes the complexity of the learning problem 
associated to a hypothesis space "H, in terms of the existence of an algorithm 
that, provided with at least n{e,5,'H) training set points, can approximately 
solve the learning problem on H with accuracy e and confidence S. 
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The sample complexity associated to a hypothesis space % can be derived 
from suitable notions of covering numbers, and related quantities, that char¬ 
acterize the size of %. Recall that, roughly speaking, the covering number 
associated to a (metric) space is defined as the minimal number of e balls 
needed to cover the space. The sample complexity can be shown Emu to be 
proportional to the logarithm of the covering number, i.e. 

n{e,S,H) oc 

0 

As a basic example, consider I to be d-dimensional and a hypothesis space of 
linear functions 

f{I) = {w,I), yiGl,wGl, 

so that the data representation is simply the identity. Then the e-covering 
number of the set of linear functions with ||?ii|| < 1 is given by 

N, - e-'^. 

If the input data lie in a subspace of dimension s < d then the covering number 
of the space of linear functions becomes A), ~ e“®. In the next section, we 
further comment on the above example and provide an argument to illustrate 
the potential benefits of invariant representations. 

3.3 Sample Complexity of the Invariance Oracle 

Consider the simple example of a set of images of p x p pixels each containing 
an object within a (square) window oikxk pixels and surrounded by a uniform 
background. Imagine the object positions to be possibly anywhere in the image. 
Then it is easy to see that as soon as objects are translated so that they not 
overlap we get an orthogonal subspace. Then, we see that there are = {p/kY 
possible subspaces of dimension that is the set of translated images can be 
seen as a distribution of vectors supported within a ball in d = dimensions. 
Following the discussion in the previous section the best algorithm based on 
a linear hypothesis space will incur in a sample complexity proportional to 
d. Assume now to have access to an oracle that can ’’register” each image so 
that each object occupies the centered position. In this case, the distribution 
of images is effectively supported within a ball in s = dimensions and the 
sample complexity is proportional to s rather than d. In other words a linear 
learning algorithm would need 

= d/s 

less examples to achieve the same accuracy. The idea is that invariant repre¬ 
sentations can act as an invariance oracle, and have the same impact on the 
sample complexity. We add a few comments. First, while the above reasoning 
is developed for linear hypothesis space, a similar conclusion holds if non linear 
hypothesis spaces are considered. Second, one can see that the set of images 
obtained by translation is a low dimensional manifold, embedded in a very high 
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dimensional space. Other transformations, such as small deformation, while be¬ 
ing more complex, would have a much milder effect on the dimensionality of the 
embedded space. Finally, the natural question is how invariant representations 
can be learned, a topic we address next. 

4 Compact Group Invariant Representations 

Consider a set of transformations Q which is a locally compact group. Recall 
that each locally compact groups has a finite measure naturally associated to 
it, the so called Haar measure. The key feature of the Haar measure is its 
invariance to the group action, and in particular for all measurable functions 
/ : ^ K, and g' G G, it holds 

J dgf{g) = j dgfig'g). 

The above equation is reminding of the invariance to translation of Lebesgue 
integrals and indeed, the Lebesgue measure can be shown to be the Haar mea¬ 
sure associated to the translation group. The invariance property of the Haar 
measure associated to a locally compact group, is key to our development of 
invariant representation, as we describe next. 

4.1 Invariance via Group Averaging 

The starting point for deriving invariant representations is the following direct 
application of the invariance property of the Haar measure. 

Theorem 1. Let i/) : I —>■ K fee a, possibly non linear, functional on I. Then, 
the functional defined by 

= J dgipigi), I Gl (2) 

is invariant in the sense of Definition^ 

The functionals ij), pL can be thought to be measurements, or features, of the 
data. In the following we are interested in measurements of the form 

'il){I) = g[{gl,t)), lGl,gGg (3) 

where t G T C I the set of unit vectors in I and 77 : M —?> K is a possibly 
non linear function. As discussed in the main motivation for considering 
measurements of the above form is their interpretation in terms of biological or 
artificial neural networks, see the following remarks. 

Remark 2 (Hubei and Wiesel Simple and Complex Cells [2]). A measurement 
as in (§ can be interpreted as the output of a neuron which computes a possibly 
high-dimensional inner product with a template t G T■ In this interpretation. 


rj can be seen as a, so called, activation function, for which natural choices are 
sigmoidal functions, such as the hyperbolic tangent or rectifying functions such 
as the hinge. The functional p,, obtained plugging ([^ m ([^ can be seen as the 
output of a second neuron which aggregates the output of other neurons by a 
simple averaging operation. Neurons of the former kind are similar to simple 
cells, whereas neurons of the second kind are similar to complex cells in the 
visual cortex. 

Remark 3 (Convolutional Neural Networks |2()j'l. The computation of a mea¬ 
surement obtained plugging (|^ m ([^ can also be seen as the output of a so 
called convolutional neural network where each neuron, if is performing the in¬ 
ner product operation between the input, I, and its synaptic weights, t, followed 
by a pointwise nonlinearity r] and a pooling layer. 

A second, reason to consider measurements of the form (|^ is computational 
and, as shown later, have direct implications for learning. Indeed, to compute 
an invariant feature, according to ([^ it is necessary to be able to compute 
the action of any element /Si for which we wish to compute the invariant 
measurement. However, a simple observation suggests an alternative strategy. 
Indeed, since the group representation is unitary, then 

{gl,l') = {l,g-^l'), V/,/'eI 

so that in particular we can compute if by considering 

if{I)= J dgp{{I,gt)), V/GI, (4) 

where we used the invariance of the Haar measure. The above reasoning implies 
that an invariant feature can be computed for any point provided that for t S 
T, the sequence gt, g G G is available. This observation has the following 
interpretation: if we view a sequence gt, g G G, as a ’’movie” of an object 
undergoing a family of transformations, then the idea is that invariant features 
can be computed for any new image provided that a movie of the template is 
available. 

While group averaging provides a natural way to tackle the problem of in¬ 
variant representation, it is not clear how a family of invariant measurements 
can be ensured to be selective. Indeed, in the case of compact groups selectivity 
can be provably characterized using a probabilistic argument summarized in the 
following three steps: 

1. A unique probability distribution can be naturally associated to each orbit. 

2. Each such probability distributions can be characterized in terms of one¬ 
dimensional projections. 

3. One dimensional probability distributions are easy to characterize, e.g. in 
terms of their cumulative distribution or their moments. 
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We note in passing that the above development, which we describe in detail 
next, naturally provides as a byproduct indications on how the non linearity 
in ^ needs to be chosen and thus gives insights on the nature of the pooling 
operation. 

4.2 A Probabilistic Approach to Selectivity 

Let I = K^, and Vil) the space of probability measures on X. Recall that 
for any compact group, the Haar measure is finite, so that, if appropriately 
normalized, it correspond to a probability measure. 

Assumption 1. In the following we assume Q to be Abelian and compact and 
the corresponding Haar measure to be normalized. 

The first step in our reasoning is the following dehnition. 

Definition 4 (Representation via Orbit Probability). For all I G X, define the 
random variable 


Zi ■■ {G,dg)-^X, Zi{g)=gl, ^gGO, 

with law 

Pi{A) 

for all measurable sets A <ZX. Let 

P:X^r{X), P{I)=pj, V/el. 

The map P associates to each point a corresponding probability distribution. 
From the above definition we see that we are essentially viewing an orbit as a 
distribution of points, and mapping each point in one such distribution. Then 
we have the following result. 

Theorem 2. For all 1,1' gX 

I^F ^ P{I) = P{I'). ( 5 ) 

Proof. We first prove that I ^ I' ^ pj = pj/. Recalling that if Cc{X) is the 
set of continuous functions on X with compact support, pi can be alternatively 
defined as the unique probability distribution such that 

J f{z)dpi{z) = J f{Ziig))dg, V/ S C,{X). (6) 

Therefore pi = pi' if and only if for any / G Cc{X), we have Jg f{Zi{g))dg = 
fg f{Zj> {g))dg which follows immediately by a change of variable and invariance 
of the Haar measure: 

[ f{Zi{g))dg= [ f{gl)dg= [ f{gl')dg= [ f{ggl)dg = f f{gl)dg 
Jg Jg Jg Jg Jg 
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To prove that pi = pi' ^ I ^ I', note that pi{A) — pi'{A) = 0 for all 
measurable sets ^ C I implies in particular that the support of the probability 
distributions of / has non null intersection on a set of non zero measure. Since 
the support of the distributions pi,pi' are exactly the orbits associated to 1,1' 
respectively, then the orbits coincide, that is / ^ □ 

The above result shows that an invariant representation can be defined con¬ 
sidering the probability distribution naturally associated to each orbit, however 
its computational realization would require dealing with high-dimensional dis¬ 
tributions. Indeed, we next show that the above representation can be further 
developed to consider only probability distributions on the real line. 

4.2.1 Tomographic Probabilistic Representations 

We need to introduce some notation and definitions. Let 'T — S, the unit sphere 
in X, and let P(K) denote the set of probability measures on the real line. For 
each t gT, let 

= V/eX. 

If p S 7^(21), for all t G T we denote by p* G P(M) the random variable with law 
given by 

p\B) = [ dp, 
for all measurable sets B cM.. 

Definition 5 (Radon Embedding). Let = {/i | /i : T —t P(K)} and 

define 

R-.V{X) R{p){t)=p\ 'ilGX. 

The above map associates to each probability distribution a (continuous) family 
of probability distributions on the real line defined by one dimensional projec¬ 
tions (tomographies). Interestingly, R can be shown to be a generalization of 
the Radon Transform to probability distributions m- We are going to use it 
to define the following data representation. 

Definition 6 (TP Representation). We define the Tomographic Probabilistic 
(TP) representation as 

4':X^P(K)'^, 4' = RoP, 

with P and R as in Definitions BIB respectively. 

The TP representation is obtained by first mapping each point in the distribu¬ 
tion supported on its orbit and then in a (continuous) family of corresponding 
one dimensional distributions. The following result characterizes the invari¬ 
ance/selectivity property of the TP representation. 

Theorem 3. Let T be the TP representation in Definition^ then for all L, L' G 
X 

l^P ^ '^(I) = '^(L'). (7) 
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The proof of the above result is obtained combining Theorem with the follow¬ 
ing well known result, characterizing probability distributions in terms of their 
one dimensional projections. 

Theorem 4 (Cramer-Wold [ 8 ]). For any p ,7 € 7^(T), it holds 

p = ^ p* = , Wt e S. ( 8 ) 

Through the TP representation, the problem of hnding invariant/selective rep¬ 
resentations reduces to the study of one dimensional distributions, as we discuss 
next. 

4.2.2 CDF Representation 

A natural way to describe a one-dimensional probability distribution is to con¬ 
sider the associated cumulative distribution function (CDF). Recall that if 
^ : (D,p) —)■ K is a random variable with law q G P(JR), then the associated 
CDF is given by 

fq{b) = qi{oo,b]) = J dp{a)H{b-^{a)), 6 e K, (9) 

where where H is the Heaviside step function. Also recall that the CDF uniquely 
defines a probability distribution since, by the Fundamental Theorem of Calcu¬ 
lus, we have 

^ =Pib)- 

We consider the following map. 

Definition 7 (CDF Vector Map). Let J^(IR) = {/i | h : M —>■ [0,1]}, and 
F{R)'^ = {h I h:T-^F{R)}. 

Define 

F-.rm^ F( 7 )(f) = /^* 

for 7 G V(R)'^ and where we let 7 * = j(t) for all t G T■ 

The above map associates to a family of probability distributions on the real 
line their corresponding CDFs. We can then define the following representation. 

Definition 8 (CDF Representation). Let 

p : I —?► F{R)'^, p = F o Ro P, 

with F,P and R as in Definitions^ respectively. 

Then, the following result holds. 
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Theorem 5. For all I G I and t G T 


lJ'\l)ib) = J dgrib{{I,gt)), & S K, 


( 10 ) 


where we let = g,{I)(t) and, for all b G M., gb ■ ^ ^ is given by 

Vb{a) = H{b — a), a € K. Moreover, for all 1,1' Gl 

I^I' ^ yi{I) = g{I'). 


Proof. The proof follows noting that g, is the composition of the one to one maps 
F,R and a map P that is one to one w.r.t. the equivalence classes induced by 
the group of transformations G. Therefore g is one to one w.r.t. the equivalence 
classes i.e. I ^ I' g{I) = g{I'). □ 


We note that, from a direct comparison, one can see that (101 is of the 
form ([^. Different measurements correspond to different choices of the thresh¬ 
old b. 


Remark 4. [Pooling Functions: from CDF to Moments and Beyond] The above 
reasoning suggests that a principled choice for the non linearity in @ is a step 
function, which in practice could be replaced by a smooth approximation such 
a sigmoidal function. Interestingly, other choices of non linearities could be 
considered. For example, considering different powers would yield information 
on the moments of the distributions (more general non linear function than 
powers would yield generalized moments). This latter point of view is discussed 
in some detail in Appendix\^ 


4.3 Templates Sampling and Metric Embedings 

We next discuss what happens if only a hnite number of (possibly random) 
templates are available. In this case, while invariance can be ensured, in general 
we cannot expect selectivity to be preserved. However, it is possible to show that 
the representation is almost selective (see below) if a sufficiently large number 
number of templates is available. 

Towards this end we introduce a metric structure on the representation 
space. Recall that if p,p' G 7^(ffi) are two probability distributions on the 
real line and fp, fpi their cumulative distributions functions, then the uniform 
Kolmogorov-Smirnov (KS) metric is induced by the uniform norm of the cumu¬ 
lative distributions that is 


dooifpjp') = sup|/p(s) - fp'{s)\, 


sGl 


and takes values in [0,1]. Then, if g is the representation in (101 we can consider 
the metric 

( 11 ) 


d{I,I') = J du{t)d^{g\l),g\l')) 
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where u is the (normalized) uniform measure on the sphere S. We note that, 
theorems]^ and [^ensure that ( 11 ) is a well defined metric on the quotient space 
induced by the group transformations, in particular 

If we consider the case in which only a hnite set 7k = {ti, ... ,tfe} G S of k 
templates is available, each point is mapped in a finite sequence of probability 
distributions or CDFs and (111 is replaced by 


k 


( 12 ) 




Clearly, in this case we cannot expect to be able to discriminate every pair of 
points, however we have the following result. 

Theorem 6. Consider n images in 7. Let k > -^ log - 
constant. Then with probability 1 — 


s ’ 


where c is a 


\diI,T)-d{I,n\<e. 


(13) 


for all /, I' G In- 

Proof. The proof follows from a direct application of Hoeffding’s inequality and 
a union bound. Fix I, I' G I„. Define the real random variable Z : S ^ [0,1], 

Z{u) = d^{p*'{I),P*'{T)), i = l,...,k. 


From the definitions it follows that \\Z\\ < 1 and E{Z) = d{I,I'). Then, 
Hdeffding inequality implies 

k 

|d(/,/') - d{I,I')\ = ^E(Z) - Z{U)\ > e, 

with probability at most 26“*^ A union bound implies that the result holds 
uniformly on with probability at least n^2e~'^ ^. The proof is concluded 
setting this probability to 5^ and taking k > -^ log j. □ 

We note that, while we considered the KS metric for convenience, other 
metrics over probability distributions can be considered. Also, we note that a 
natural further question is how discretization/sampling of the group affects the 
representation. The above reasoning could be extended to yield results in this 
latter case. Finally, we note that, when compared to classical results on distance 


only ensures distance preservation up to a given accuracy which increases with 
a larger number of projections. This is hardly surprising, since the problem of 
finding suitable embedding for probability spaces is known to be considerably 
harder than the analogue problem for vector spaces [2]- The question of how 
devise strategies to define distance preserving embedding is an interesting open 
problem. 


preserving embedding, such as Johnson Linderstrauss Lemma |18] . Theorem 12 
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5 Locally Invariant and Covariant Representa¬ 
tions 


We consider the case where a representation is given by collection of ’’local” 
group averages, and refer to this situation as the partially observable group 
(POG) case. Roughly speaking, the idea is that this kind of measurements 
can be invariant to sufficiently small transformations, i.e. be locally invariant. 
Moreover, representations given by collections of POG averages can be shown 
to be covariant (see section 5.2 for a definition). 


5.1 Partially Observable Group Averages 

For a subset Qq a G consider a POG measurement of the form 

?/>(/)=/ dgTj{{I,gt)). (14) 

JQo 

The above quantity can be interpreted as the ’’response” of a cell that can 
perceive visual stimuli within a ’’window” (receptive field) of size ^o- A POG 
measurement corresponds to a local group average restricted to a subset of 
transformations Go- Clearly, such a measurement will not in general be invariant. 
Consider a POG measurement on a transformed point 

f dg'g{{gl,gt))= f dgr]{{l,g-^gt)) = f dgr]{{I, gt)). 

J Qq J Go gOo 

If we compare the POG measurements on the same point with and without a 
transformation, we have 


if dgr]{{I,gt))- f dgg{{I, gt))\. (15) 

dOo d gQo 

While there are several situations in which the above difference can be zero, 
the intuition from the vision interpretation is that the same response should be 
obtained if a sufficiently small object does not move (transform) too much with 
respect to the receptive held size. This latter situation can be described by the 
assumption that the function 


h[g) = ri{{I, gt)) 


is zero outside of the intersection of gGo F Qo- Indeed, for all 
this latter assumption, the difference in (151 would clearly be 
reasoning results in the following theorem. 


g G G satisfying 
zero. The above 


Theorem 7. Given I gT and t gT , assume that there exists a set G G G such 
that, for all g G G, 

diil, gt)) = 0 ygi gGo n Go- (16) 


Then for g G G 


m = i^igi), 


with Ip as in (14). 
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{I,gt) =0, Vg t g^Q n 


g^o 


^0 


g^o^ 00 

Figure 1: A sufficient condition for invariance for locally compact groups: if 
{gl,t) = 0 for all g S gQ^AQo, the integral of gb {I,gt) over or gQo will be 
equal. 


We add a few comments. First, we note that condition ( |16[ ) can be weakened 
requiring only g{{I,gt)) = 0 for all g € gQ^AQo, where we denote by A the 
symmetric difference of two sets {AAB = (A U B)/{A n B) with A,B sets). 
Second, we note that if the non linearity g is zero only in zero, then we can 
rewrite condition (16) as 


{l,gt)=0, ygegGoAGo. 


Finally, we note that the latter expression has a simple interpretation in the case 
of the translation group. In fact, we can interpret ( |16| ) as a spatial localization 
condition on the image I and the template t (assumed to be positive valued 
functions), see FigureWe conclude with the following remark. 

Remark 5 (Localization Condition and VI). Regarding the localization con¬ 
dition discussed above, as we comment elsewhere m, the fact that a template 
needs to he localized could have implications from a biological modeling stand¬ 
point. More precisely, it could provides a theoretical foundation of the Gabor 
like shape of the responses observed in VI cells in the visual cortex mail/. 


Remark 6 (More on the Localization Condition). From a more mathematical 
point of view, an interesting question is about conditions under which whether 
the localization condition (16) is also necessary rather than only sufficient. 
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5.2 POG Representation 

For all g G G, let gGo = {9 & G \ 9 = 9 ' G Go}, the collection of ’’local” 

subsets of the group obtained from the subset Go- Moreover, let 



Clearly, by the invariance of the measure, we have dg = V, for all g G G- 
Then, for all I G I, g G G, define the random variables 

Zi,g ■■ gGo 2 :, Zi^g{g) = gl, g G gGo, (17) 

with laws ^ 

Pi,g{A) = - dg,t 

y Jz-iiA) 

for all measurable sets A C I. For each I G I, g G G, the measure pi^g 
corresponds to the distribution on the fraction of the orbit corresponding to the 
observable group subset gGo- Then we can represent each point with a collection 
of POG distributions. 

Definition 9 (Representation via POG Probabilities). Let V{T)^ = {h \ h : 
G -G Vil)} and define 

P{I){g)=ppg yiGT,gGG 

Each point is mapped in the collection of distributions obtained considering all 
possible fractions of the orbit corresponding to gGo, 9 ^ G- Note that, the action 
of an element g G G oi the group on the POG probability representation is given 

gP{I){g) = P{I){gg) 
for all g G G- The following result holds. 

Theorem 8. Let P as in Definition ([^. Then for all 1,1' gX if 

I I' ^ 3g G G such that P{I') = gP{I). (18) 

Equivalently, for all L, L' G X if 

r = ~gi 

then 

P{I'){g) = P{I){gg), V 5 € G- (19) 

i.e. P is covariant. 

Proof. The proof follows noting that pipg = pi,gg holds since, using the same 
characterization of p as in (|^,we have that for any / G CfiX) 

[ f{Zr^g{g))dg = j f{gl')dg = j f{ggl)dg = ( f{gl)dg 
•) gQo -I gQo gQo gOog 

where we used the invariance of the measure. □ 
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Following the reasoning in the previous sections and recalling Definition 
we consider the mapping given by one dimensional projections (tomographies) 
and corresponding representations. 

Definition 10 (TP-POG Representation). Let = {h \ h : Q x T ^ 

7^(M)} and define 

R : V{lf ^ Rm9,t) = R{H9))it) = h\g), 

for all h G G Q,t G T. Moreover, we define the Tomographic Proba¬ 

bilistic POG representation as 

^-.X , ^ = RoP, 

with P as in Definition^ 

We have the following result: 

Theorem 9. The representation ^ defined in 

^il)i99)- 

Proof. The map ^ = R o P is covariant if both R and P are covariant. The 
map P was proven to be covariant in Theorem We then need to prove the 
covariance of R i.e. gR{h){g,t) = R{h){gg,t) for all h G VlfiPf. This follows 
from 

R{gh)ig,t) = R{gh{g))(t) = R{h{gg)){f) = R{h){gg,t). 


10 is covariant, i.e. '^{gl){g) = 


□ 

The TP-POG representation is obtained by first mapping each point I in the 
family of distributions pi^g, g G Q supported on the orbit fragments correspond¬ 
ing to POG and then in a (continuous) family of corresponding one dimensional 
distributions p\ g, g G Q, t gT. Finally, we can consider the representation 
obtained representing each distribution via the corresponding CDF. 

Definition 11 (CDF-POG Representation). Let J^(IR)^^^ = {h \ h ■. Q xT ^ 
J^(M)} and define 

F : 7^(j)exr ^ ^ ^ ^ 

for all h G and g G Q,t G T. Moreover, define the CDF-POG repre¬ 

sentation as 

fL-.X ^ fl = FoRoP, 

with P,F as in Definition\^ [7^ respectively. 

It is easy to show that 

Tg,t{I){R)= ( Vb{{I,9t))dg. (20) 

•I960 

where we let = 9'(.I){gfi)- 
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6 Further Developments: Hierarchical Repre¬ 
sentation 


In this section we discuss some further developments of the framework presented 
in the previous section. In particular, we sketch how multi-layer (deep) repre¬ 
sentations can be obtained abstracting and iterating the basic ideas introduced 
before. 

Hierarchical representations, based on multiple layers of computations, have 
naturally arisen from models of information processing in the brain mm- 
They have also been critically important in recent machine learning successes in 
a variety of engineering applications, see e.g. m- In this section we address the 
question of how to generalize the framework previously introduced to consider 
multi-layer representations. 

Recall that the basic idea for building invariant/selective representation is 
to consider local (or global) measurements of the form 


'So 


v{{i,gt))dg, 


( 21 ) 


with t/o ^ A main difficulty to iterate this idea is that, following the devel¬ 


opment in previous sections, the representation pT|)-(|20|), induced by collection 

'iSxT 


of (local) group averages, maps the data space I in the space 7^(M)^ ^'. The 
latter space lacks an inner product as well as natural linear structure needed 


to define the measurements in (21). One possibility to overcome this problem 


is to consider an embedding in a suitable Hilbert space. The first step in this 
direction is to consider an embedding of the probability space 7^(K) in a (real 
separable) Hilbert space T-L. Interestingly, this can be achieved considering a 
variety of reproducing kernels over probability distributions, as we describe in 
Appendix]^ Here we note that if $ : 7^(K) —)• H is one such embeddings, then 
we could consider a corresponding embedding of in the space 


L^gxT,n) = {h:gxT^n 


\h{g,t)\\ dgdu{t)} 


where is the norm induced by the inner product (•, in H and u is the 
uniform measure on the sphere S <Z I. The space L‘^(g x T, T-L) is endowed with 
the inner product 

(/i, h')^ = J {h{g, t),h'{g, t))^ dgdu{t), 
for all h, h' G L'^{g x T, H), so that the corresponding norm is exactly 

ll^llw = J \\h{g^t)f dgdu{t). 

The embedding of 7^(M)^^^ in L'^{g x T, H) is simply given by 

J$ : P(K)^^'^ ^ L^ig X r, H), Mp){g, t) = $(p(g, t)) i.e. 

for all p G . Provided with above notation we have the following result. 
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Theorem 10. The representation defined hy 


Q -.1^ L^{g xT,H), Q = J^o'i>. 


with as in 


Definition\l(\ is covariant, in the sense that, 


Q{gl) = gQ{I) 


(22) 


for all I €l, g £ G- 

Proof. The proof follows checking that by definition both R and J$ are covariant 
and using Theorem The fact that R is covariant was proven in Th. The 
covariance of i.e. gJ^{h)[g,t) = J,t,{h){gg,t) for all h G follows 

from 


J<s>{gh){g,t) = <^>{gh{g,t)) = <^{h{gg,t)) = J^{h)[gg,f). 

Now since P was already proven covariant in Th. [^we have that, being Q = 
J^oRoP composition of covariant representations, Q is covariant i.e. gQ{I) = 
Q{gl). □ 

Using the above definitions a second layer invariant measurement can be 
defined considering. 


v{I)=[ r]i{Qix),gT).^)dg 

J Qn 


(23) 


where t G Lf{G x T,TL) has unit norm. 

We add several comments. First, following the analysis in the previous sec¬ 


tions Equation (231 can be used to define invariant (or locally invariant) mea¬ 


surements and hence representations defined by collections of measurements. 
Second, the construction can be further iterated to consider multi-layer rep¬ 
resentations, where at each layer an intermediate representation is obtained 
considering ’’distributions of distributions”. Third, considering multiple layers 
naturally begs the question of how the number and properties of each layer affect 
the properties of the representation. Preliminary answers to these questions are 
described in [3 a mi nsj- A full mathematical treatment is beyond the scope 
of the current paper which however provides a formal framework to tackle them 
in future work. 


7 Discussion 

Motivated by the goal of characterizing good data representation that can be 
learned, this paper studies the mathematics of an approach to learn data rep¬ 
resentation that are invariant and selective to suitable transformations. While 
invariance can be proved rather directly from the invariance of the Haar measure 
associated with the group, characterizing selectivity requires a novel probabilis¬ 
tic argument developed in the previous sections. 
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Several extensions of the theory are natural and have been sketched with pre¬ 
liminary results in 13 Him [23]. The main directions that need a rigorous theory 
extending the results of this paper are: 

• Hierarchical architectures. We described how the theory can be used to an¬ 
alyze local invariance properties, in particular for locally compact groups. 
We described covariance properties. Covariant layers can integrate rep¬ 
resentations that are locally invariant into representations that are more 
globally invariant. 

• Approximate invariance for transformations that are not groups. The 
same basic algorithm analyzed in this paper is used to yield approximate 
invariance, provided the templates transforms as the image, which requires 
the templates to be tuned to specific object classes. 

We conclude with a few general remarks connecting our paper with this special 
issue on deep learning and especially with an eventual theory of such networks. 
Hierarchical architectures of simple and complex units. Feedforward architecture 
with n layers, consisting of dot products and nonlinear pooling functions, are 
quite general computing devices, basically equivalent to Turing machines run¬ 
ning for n time points (for example the layers of the HMAX architecture in [25] 
can be described as AND operations (dot products) followed by OR operations 
(pooling), i.e. as disjunctions of conjunctions.). Given a very large set of labeled 
examples it is not too surprising that greedy algorithms such as stochastic gra¬ 
dient descent can find satisfactory parameters in such an architecture, as shown 
by the recent successes of Deep Convolutional Networks. Supervised learning 
with millions of examples, however, is not, in general, biologically plausible. 
Our theory can be seen as proving that a form of unsupervised learning in con¬ 
volutional architectures is possible and effective, because it provides invariant 
representations with small sample complexity. 

Two stages: group and non-group transformations. The core of the theory ap¬ 
plies to compact groups such as rotations of the image in the image plane. Exact 
invariance for each module is equivalent to a localization condition which could 
be interpreted as a form of sparsity [3]- If the condition is relaxed to hold ap¬ 
proximately it becomes a sparsity condition for the class of images w.r.t. the 
dictionary t^ under the group G when restricted to a subclass of similar im¬ 
ages. This property, which is similar to compressive sensing “incoherence” (but 
in a group context), requires that / and t^ have a representation with rather 
sharply peaked autocorrelation (and correlation) and guarantees approximate 
invariance for transformations which do not have group structure, see |21] . 
Robustness of pooling. It is interesting that the theory is robust with respect to 
the pooling nonlinearity. Indeed, as discussed, very general class of nonlineari¬ 
ties will work, see Appendix A. Any nonlinearity will provide invariance, if the 
nonlinearity does not change with time and is the same for all the simple cells 
pooled by the same complex cells. A sufficient number of different nonlineari¬ 
ties, each corresponding to a complex cell, can provide selectivity [3]- 
Biological predictions and biophysics, including dimensionality reduction and 
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PCAs. There are at least two possible biophysical models for the theory. The 
first is the original Hubei and Wiesel model of simple cells feeding into a com¬ 
plex cell. The theory proposes the ’’ideal” computation of a CDF, in which case 
the nonlinearity at the output of the simple cells is a threshold. A complex 
cell, summating the outputs of a set of simple cells, would then represent a bin 
of the histogram; a different complex cell in the same position pooling a set of 
similar simple cells with a different threshold would represent another bin of the 
histogram. 

The second biophysical model for the HW module that implements the com¬ 
putation required by i-theory consists of a single cell where dendritic branches 
play the role of simple cells (each branch containing a set of synapses with 
weights providing, for instance, Gabor-like tuning of the dendritic branch) with 
inputs from the LGN; active properties of the dendritic membrane distal to the 
soma provide separate threshold-like nonlinearities for each branch separately, 
while the soma summates the contributions for all the branches. This model 
would solve the puzzle that so far there seems to be no morphological differ¬ 
ence between pyramidal cells classified as simple vs complex by physiologists. 
Further if the synapses are Hebbian it can be proved that Hebb’s rule, appropri¬ 
ately modified with a normalization factor, is an online algorithm to compute 
the eigenvectors of the input covariance matrix, therefore tuning the dendritic 
branches weights to principal components and thus providing an efficient di¬ 
mensionality reduction. 

(n —>■ 1 j.The present phase of Machine Learning is characterized by supervised 
learning algorithms relying on large sets of labeled examples (n —>■ oo). The next 
phase is likely to focus on algorithms capable of learning from very few labeled 
examples (n —> 1), like humans seem able to do. We propose and analyze a pos¬ 
sible approach to this problem based on the unsupervised, automatic learning 
of a good representation for supervised learning, characterized by small sample 
complexity (n). In this view we take a step towards a major challenge in learn¬ 
ing theory beyond the supervised learning, that is the problem of representation 
learning, formulated here as the unsupervised learning of invariant representa¬ 
tions that significantly reduce the sample complexity of the supervised learning 
stage. 
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A Representation Via Moments 

In Section [4.2.2| we have discussed the derivation of invariant selective represen¬ 
tation considering the CDFs of suitable one dimensional probability distribu¬ 
tions. As we commented in Remark alternative representations are possible, 
for example by considering moments. Here we discuss this point of view in some 
more detail. 

Recall that if ^ : (H,p) —> M is a random variables with law q G 7^(IR), then 
the associated moment vector is given is given by 

= Eiei’'= I dglel^ rGN. (24) 

In this case we have the following definitions and results. 

Definition 12 (Moments Vector Map). Let A4(K) = {h \ h : N —)■ K}, and 
M{R)^ = {h 1 h:T-^M{R)}. 

Define 

M ^ M(/I)(t)=m^t 

for ~p G P(R) and where we let ~p{t) = , for all t gT ■ 

The above mapping associates to each one dimensional distribution the cor¬ 
responding vector of moments. Recall that this association uniquely determines 
the probability distribution if the so called Carleman’s condition is satisfied: 

OO ^ 

r—1 
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where nir is the set of moments of the distribution. 

We can then define the following representation. 

Definition 13 (Moments Representation). Let 

/r : X — ^ = M o Ro 

with M,P and R as in Definitions\lS\ fJl respectively. 

Then, the following result holds. 

Theorem 11. For all I Gl and t & T 

li\l){r) = j dg\ {I,gt) T, r S N, 

where we let fi{I){t) = /i*(/). Moreover, for all 1,1' & I 

I^I' ^ pi{I)=p{I'). 

Proof. fi = MoRoP is a composition of a one to one map R, a map P that is one 
to one w.r.t. the equivalence classes induced by the group of transformations Q 
and a map M that is one to one since Carleman’s condition is satisfied. Indeed, 
we have, 

i:(/ dga.stry”<Y.( %i(/,!>*)i)"' =Il5 = +“ 

where C = f dg\ {I,gt) |. Therefore p is one to one w.r.t. the equivalence 
classes i.e. I ^ I' p{I) = p{I'). □ 

We add one remark regarding possible developments of the above result. 

Remark 7. Note that the above result essentially depends on the characteriza¬ 
tion of the moment problem of probability distributions on the real line. In this 
view, it could be further developed to consider for example the truncated case 
when only a finite number of moments is considered or the generalized moments 
problem, where families of (nonlinear) continuous functions, more general than 
powers, are considered (see e.g. m)- 

B Kernels on probability distributions 

To consider multi-layers within the framework proposed in the paper we need 
to embed probability spaces in Hilbert spaces. A natural way to do so is by con¬ 
sidering appropriate positive definite (PD) kernels, that is symmetric functions 
AT : A X A —)■ K such that 


K{pi,pj)aiaj > 0 

i,i=l 
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for all Vpi, ... ,pn G X,ai,... ,an G M and where X is any set, e.g. X = K or 
X = 7^(]R). Indeed, PD kernels are known to define a unique reproducing kernel 
Hilbert space (RKHS) Hk for which they correspond to reproducing kernels, in 
the sense that if Hk is the RKHS defined by K, then Kj; = K{x, •) € Hk for 
&W- X G X and 

(/, K,)k = fi^)^ V/ gHk,xG X, (25) 

where {'g)k is the inner product in Hk (see for example [7] for an introduction 
to RKHS). 

Many examples of kernels on distributions are known and have been studied. 
For example discuss a variety of kernels of the form 

Kip,p')=J J d'y{x)K{pp{x),Pp'{x)) 

where Pp,Pp' are the densities of the measures p, p' with respect to a dominating 
measure 7 (which is assumed to exist) and k : K)}" x K(J" —>■ K is a PD kernel. 
Recalling that a PD kernel defines a pseudo-metric via the equation 

dxip^p'f = K{p,p) + K{p',p) - 2K{p,p'). 


it is shown in [laisg how different classic metric on probability distributions 
can be recovered by suitable choices of the kernel k. For example. 


k{x, x') 


corresponds to the Hellinger’s distance, see for other examples. 

A different approach is based on defining kernels of the form 

K{p,p')=j J dp{x)dp'{x')k(x,x'), (26) 

where fciKxIR—J-Kisa PD kernel. Using the reproducing property of k we 
can write 

K{p,p') = (^J dpix')koo, J dp{x)k^^ = ($(p), $(p')) 

where $ : 'P(M) —>■ H is the embedding $(x) = / dp{x')kx mapping each distri¬ 
bution in a corresponding kernel mean, see e.g. [7]. Condition on the kernel k, 
hence on K, ensuring that the corresponding function dj^ is a metric have been 
studied in detail, see e.g. [55] ■ 
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